9 research outputs found
A Bayesian Approach to Graphical Record Linkage and Deduplication
© 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online
SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication
We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate -way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data
Generalized Bayesian Record Linkage and Regression with Exact Error Propagation
Record linkage (de-duplication or entity resolution) is the process of
merging noisy databases to remove duplicate entities. While record linkage
removes duplicate entities from such databases, the downstream task is any
inferential, predictive, or post-linkage task on the linked data. One goal of
the downstream task is obtaining a larger reference data set, allowing one to
perform more accurate statistical analyses. In addition, there is inherent
record linkage uncertainty passed to the downstream task. Motivated by the
above, we propose a generalized Bayesian record linkage method and consider
multiple regression analysis as the downstream task. Records are linked via a
random partition model, which allows for a wide class to be considered. In
addition, we jointly model the record linkage and downstream task, which allows
one to account for the record linkage uncertainty exactly. Moreover, one is
able to generate a feedback propagation mechanism of the information from the
proposed Bayesian record linkage model into the downstream task. This feedback
effect is essential to eliminate potential biases that can jeopardize resulting
downstream task. We apply our methodology to multiple linear regression, and
illustrate empirically that the "feedback effect" is able to improve the
performance of record linkage.Comment: 18 pages, 5 figure
Using metric space indexing for complete and efficient record linkage
Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.Postprin